Smoothed Bloom Filter Language Models: Tera-Scale LMs on the Cheap
نویسندگان
چکیده
A Bloom filter (BF) is a randomised data structure for set membership queries. Its space requirements fall significantly below lossless information-theoretic lower bounds but it produces false positives with some quantifiable probability. Here we present a general framework for deriving smoothed language model probabilities from BFs. We investigate how a BF containing n-gram statistics can be used as a direct replacement for a conventional n-gram model. Recent work has demonstrated that corpus statistics can be stored efficiently within a BF, here we consider how smoothed language model probabilities can be derived efficiently from this randomised representation. Our proposal takes advantage of the one-sided error guarantees of the BF and simple inequalities that hold between related n-gram statistics in order to further reduce the BF storage requirements and the error rate of the derived probabilities. We use these models as replacements for a conventional language model in machine translation experiments.
منابع مشابه
A Cuckoo Filter Modification Inspired by Bloom Filter
Probabilistic data structures are so popular in membership queries, network applications, and so on. Bloom Filter and Cuckoo Filter are two popular space efficient models that incorporate in set membership checking part of many important protocols. They are compact representation of data that use hash functions to randomize a set of items. Being able to store more elements while keeping a reaso...
متن کاملMorpheme level hierarchical pitman-yor class-based language models for LVCSR of morphologically rich languages
Performing large vocabulary continuous speech recognition (LVCSR) for morphologically rich languages is considered a challenging task. The morphological richness of such languages leads to high out-of-vocabulary (OOV) rates and poor language model (LM) probabilities. In this case, the use of morphemes has been shown to increase the lexical coverage and lower the LM perplexity. Another approach ...
متن کاملIntegrating High and Low Smoothed LMs in a CSR System
1 In Continuous Speech Recognition (CSR) systems, acoustic and Language Models (LM) must be integrated. To get optimum CSR performances, it is well-known that heuristic factors must be optimised. Due to its great effect on final CSR performances, the exponential scaling factor applied to LM probabilities is the most important. LM probabilities are obtained after applying a smoothing technique. ...
متن کاملRouting on large scale mobile ad hoc networks using bloom filters
A bloom filter is a probabilistic data structure used to test whether an element is a member of a set. The bloom filter shares some similarities to a standard hash table but has a higher storage efficiency. As a drawback, bloom filters allow the existence of false positives. These properties make bloom filters a suitable candidate for storing topological information in large-scale mobile ad hoc...
متن کاملGradient Compared Lp-LMS Algorithms for Sparse System Identification
In this paper, we propose two novel p-norm penalty least mean square (lp-LMS) algorithms as supplements of the conventional lp-LMS algorithm established for sparse adaptive filtering recently. A gradient comparator is employed to selectively apply the zero attractor of p-norm constraint for only those taps that have the same polarity as that of the gradient of the squared instantaneous error, w...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2007